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Abstract. — In this paper, the performances of the quasi-Newton BFGS algo- 
rithm, the NEWUOA derivative free optimizer, the Covariance Matrix Adapta- 
tion Evolution Strategy (CMA-ES), the Differential Evolution (DE) algorithm 
and Particle Swarm Optimizers (PSO) are compared experimentally on bench- 
mark functions reflecting important challenges encountered in real-world op- 
(yj ' timization problems. Dependence of the performances in the conditioning of 

O^ ■ the problem and rotational invariance of the algorithms are in particular investi- 

gated. 



1 Introduction 

Continuous Optimization Problems (COPs) aim at finding the global optimum 
(or optima) of a real-valued function (aka objective function) defined over a 



in 

o 

Q I (subset of) a real vector space. COPs commonly appear in everyday's life of 

many scientists, engineers and researchers from various disciplines, from physics 
to mechanical, electrical and chemical engineering to biology. Problems such as 
model calibration, process control, design of parameterized parts are routinely 



^ • modeled as COPs. Furthermore, in many cases, very little is known about the 



objective function. In the worst case, it is only possible to retrieve objective 
function values for given inputs, and in particular the user has no information 
about derivatives, or even about some weaker characteristics of the objective 
function (e.g. monotonicity, roughness, . . . ). This is the case, for instance, when 
the objective function is the output of huge computer programs ensuing from 
several years of development, or when experimental processes need to be run in 
order to compute objective function values. Such problems amount to what is 
called Black-Box Optimization (BBO). 



' Invited Paper at the 8 ' International Symposium on Experimental Algorithms, June 3-6, 2009, 
Dortmund, Germany 



Because BBO is a frequent situation, many optimization methods (aka search 
algorithms) liave been proposed to tackle BBO problems, that can be grossly 
classified in two classes: (i) deterministic methods include classical derivative- 
based algorithms, in which the derivative is numerically computed by finite dif- 
ferences, and more recent Derivative Free Optimization (DFO) algorithms H], 
like pattern search [2| and trust region methods [3|; (ii) stochastic methods rely 
on random variables sampling to better explore the search space, and include 
recently introduced bio-inspired algorithms (see Section O. 

However, the practitioner facing a BBO problem has to choose among those 
methods, and there exists no theoretical solid ground where he can stand to per- 
form this choice, first because he does not know much about his objective func- 
tion, but also because all theoretical results either make simplifying hypotheses 
that are not valid for real-world problems, or give results that do not yield any 
practical outcome. Moreover, most of BBO methods require some parameter 
tuning, and here again very little help is available for the practitioner, who is left 
with a blind and time-consuming test-and-trial approach. 

In such context, this paper proposes an experimental perspective on BBO 
algorithms comparisons. Rigorous procedures to compare the results of differ- 
ent BBO algorithms have been proposed H, taking into account the stochastic 
nature of many of them, and giving fair chances to each one of them. How- 
ever, a critical issue in such experiments is that of the benchmark suite. And 
because no set of real-world problems can be guaranteed to cover all possible 
cases of difficult COPs, the approach that has been chosen here is to build artifi- 
cial test functions with some precise characteristics that are known to be possi- 
ble sources of difficulty for optimization (e.g. ill-conditioning, non- separability, 
non-convexity, ruggedness, . . . ). Such experimental results could then be cau- 
tiously generalized, leaving only a few good-performing algorithms in each spe- 
cific context. 

Of course, in real-life BBO situations, it is assumed that nothing is known 
about the objective function. However, the user sometimes has some partial in- 
formation (e.g. because his problem is known to be similar to other better-known 
problems) that might lead him to decide for a BBO method that is (experimen- 
tally) known to perform well, 'in vitro', in his precise situation. But on the other 
hand, assuming absolutely nothing is known in advance about the objective 
function, running the champion algorithms as identified in perfectly controlled 
environment might give him some information about his function (e.g. if numer- 
ical gradient-based algorithms perform 100 times better than all other methods, 
his problem is probably very similar to a quadratic problem). This paper is a 
first step in aiming such 'in vitro' results. 



Next, in Section [2l some characteristics of the objective function are sur- 
veyed that are known to make the corresponding BBO problem hard. Section[3] 
introduces the algorithms that will be compared here. Section |4] then introduces 
the test bench that illustrates the different difficulties highlighted in Section |2l 
as well as the experimental conditions of the comparisons. The results are pre- 
sented and discussed in Section [51 and the paper ends with some conclusions in 
Section [6l 

2 What makes a search problem difficult? 

In this section, we discuss problem characteristics that are especially challeng- 
ing for search algorithms. 

2.1 Ill-conditioning 

The conditioning of a problem can be defined as the range (over a level set) 
of the maximum improvement of objective function value in a ball of small 
radius centered on a given level set. In the case of convex quadratic functions 
{f{x) = jx^Hx where H is a symmetric definite matrix), the conditioning can 
be exactly defined as the condition number of the Hessian matrix H, i.e., the 
ratio between the largest and smallest eigenvalue. Since level sets associated to 
a convex quadratic function are ellipsoids, the condition number corresponds to 
the squared ratio between the largest and shortest axis lengths of the ellipsoid. 

Problems are typically considered as ill-conditioned if the conditioning is 
larger than 10^. In practice we have seen problems with conditioning as large as 
10^°. In this paper we will quantitatively assess the performance dependency on 
the conditioning of the objective function. 

2.2 Non-separability 

An objective function f{xi,...,Xn) is separable if the optimal value for any 
variable x,- can be obtained by optimizing /(xi , . . . ,x,_i ,x,-,x,_|-i , . . . ,x„) for any 
fixed choice of the variables xi , . . . ,x,_i ,x,+i , . . . ,x„. Consequently optimizing 
an n-dimensional separable objective function reduces to optimizing n one- 
dimensional functions. 

Functions that are additively decomposable, i.e., that can be written as /(x) = 
LLi/'(-^() ^^ separable. One way to render a separable test function non-sepa- 
rable is to rotate first the vector x, which can be achieved by multiplying x by an 
orthogonal matrix B: if x i— ;• /(x) is separable, the function x i— ;■ f{Bx) might be 
non-separable for all non-identity orthogonal matrices B. In this paper we will 
investigate separable and non-separable problems. 



2.3 Non-convexity 

Some BBO methods implicitly assume or exploit convexity of the objective 
function. Composing a convex function / G M to the left with a monotonous 
transformation g : M — > M can result in a non-convex function, for instance the 
one-dimensional convex function f{x) = x^ composed with the monotonous 
function g{.) = [.j'/"* becomes the non-convex function \/J7\. In this paper we 
will assess performance dependency on convexity. 

3 Algorithms tested 

This section introduces the different algorithms that will be compared in this 
paper. They have been chosen because they are considered to be the champi- 
ons in their category, both in the deterministic optimization world (BFGS and 
NEWUOA) and in the stochastic bio-inspired world (CMA-ES, DE and PSO). 
They will also be a priori discussed here with respect to the difficulties of con- 
tinuous optimization problems highlighted in previous Section |2] 

3.1 The algorithms 

BFGS is a well-known quasi-Newton (i.e. gradient-based) method: from the 
current point, it computes a 'descent direction' using an approximation of the 
inverse of the Hessian matrix of the objective function applied to its gradient, 
and performs a line-search (ID optimization) along this direction. It then up- 
dates the approximate inverse Hessian. BFGS method is a local method: it has 
a proven convergence to a stationary point. . . provided the starting point is close 
enough from the solution, and the objective function is regular. The Matlab® 
version of BFGS (Matlab function fminunc) will be used here, because it is 
blindly used by many scientists facing optimization problems. Default parame- 
ters were used except for stopping criteria: the algorithms stops if the function 
value improvement in one iteration is less than 10^^^. 

In BBO context, the gradients have to be computed numerically (an option 
in Matlab BFGS), which might be a source of possible numerical problems. 

NEWUOA (NEW Unconstrained Optimization Algorithm) has been proposed 
by Powell |3|: it is a DEO algorithm using the trust region paradigm. The 
trust region is a ball, centered on the current best point. NEWUOA computes a 
quadratic interpolation of the objective function within the current trust region, 
based on known values of the objective, and then performs a truncated conjugate 
gradient minimization of the interpolated model in the trust region. It then up- 
dates either the cuiTcnt best point or the radius of the trust region, based on the a 



posteriori interpolation error, and some thresholds on the trust region size. Here, 

the implementation by Matthieu Guibert posted at http : //www, inrialpes . f r/bipop/people/guilbert/n 

has been used. 

An important parameter of NEWUOA is the quadratic model to use for the 
interpolation, or, equivalently, the number of points that are necessary to com- 
pute the interpolation. As recommended by Powell [3], 2« + 1 points have been 
used here (n is the dimension of the search space). Other critical parameters are 
the initial and final radii of the trust region: the initial radius governs the granu- 
larity of the objective function that the algorithm will 'see' and the final radius 
tunes the amount of local search that will performed. Here the initial and final 
values 100 and 10^^^ were used, after some prehminary experiments. 



CMA-ES is an Evolution Strategy (ES) 115 1611 algorithm: from a set of 'parents' 
(potential solutions), 'offspring' are created by sampling Gaussian distributions, 
and the best of the offspring (according to the objective function values) become 
the next parents. The art of Evolution Strategies lies in the way the parameters 
of the Gaussian distributions are updated: the Covariance Matrix Adaptation Q 
uses the path that has been followed by evolution so far to (i) adapt the step- 
size, a scaling parameter that tunes the granularity of the search, comparing 
the actual path length to that of a random walk; and (ii) modify the covariance 
matrix of the multivariate Gaussian distribution by modifying its eigenvectors 
in order to increase the likelihood of recent beneficial moves. A single Gaussian 
distribution is maintained, its mean being a linear combination of the parents. 
Besides the population size, CMA-ES is parameter-free. The population size 
has been set to its default value 4+ L31og(?i)J, but it needs to be increased in 
order to tackle highly rugged search landscapes. The initial step-size has been 
set to a third of the parameters' range. The version used in this paper (Scilab 
0.92) implements weighted recombination and rank-/j update fSl (version 0.99 
is available at , http: //www. Iri. fr/~hansen/cmaes_inmatlab.html) . 

PSO (Particle Swarm Optimization) [9] is a bio-inspired algorithm that recently 
raised a lot of interest, thanks to several published good results, and the simplic- 
ity of its implementation. The biological paradigm is that of a swarm of particles 
that 'fly' over the objective landscape, exchanging information about the best 
locations (i.e. potential solutions) they have seen. More precisely, each particle 
updates its velocity, stochastically twisting it toward the direction of the best 
positions so far visited by (i) itself and (ii) the whole swarm; it then updates its 
position according to its velocity and computes the new value of the objective 
function. 



A Scilab transcription of the Standard PSO 2006, that is still available on 
the main page of PSO Central .h ttp://www.particleswarm.info/, was used 
here, with default settings. 

Differential Evolution (DE ifTOll ) borrows from Evolutionary Algorithms (EAs) 
a population of potential solutions that evolves subject to objective-function 
based selection. However, the main operator used to generate new solutions, that 
somehow replaces mutation, is specific to DE (and the source for its name): the 
difference between two points in the population is added to a third one. Uniform 
crossover is used with some probability. The implementation posted by the orig- 
inal authors athttp: //www. icsi .berkeley.edu/~storn/code.html was used 
here. However, the authors themselves confess, in their guidance to DE param- 
eter tuning, that the results might be very dependent on the parameters. They 
propose in the code 6 possible settings, and extensive experiments (3 x 288 tri- 
als) on a moderately ill-conditioned problem lead us to consider the ''DE/local- 
to-best/1/bin" strategy, where a single difference vector, computed between a 
random point and the best point in the population, is used to generate the new 
points. In those preliminary experiments, the use of crossover seemed to have 
little beneficial impact on the results, so no crossover was used, thus making 
DE rotationally invariant (see below). Those preliminary experiments also re- 
sulted in values of the other parameters of DE: the population size was set to the 
recommended value of 1 On, a weighting factor to F = 0.8. 

3.2 Invariances 

Some a priori comparisons can be made about those algorithms, related to the 
notion of invariance. Indeed, invariances add to the robustness of an algorithm: 
functions belonging to the same equivalence class with respect to some invari- 
ance property will look exactly the same for an algorithm that is invariant under 
the transformation defining this equivalence class. 

Two sets of invariance properties are distinguished, whether they regard 
transformations of the objective function value or transformations of the search 
space. First, all comparison-based algorithms are invariant under monotonous 
transformations of the objective function, as comparisons are unaltered if the 
objective function / is replaced with some go f for some monotonous function 
g. All bio-inspired algorithms used in this paper are comparison-based, while 
the BEGS and NEWUAO are not (see Section |23]). 

Regarding transformations of the search space, all algorithms are trivially 
invariant under translation of the coordinate system. But let us consider some 
orthogonal rotations: BEGS is coordinate-dependent due to the computation of 



numerical gradients. NEWUOA is invariant under rotation when considering 
the complete quadratic model, i.e. built with ^{n + \){n + 2) points. This vari- 
ant is however often more costly compared to the 2n-\-l one - but the latter is 
not invariant under rotation. The rotational invariance of CMA-ES is built-in, 
while that of DE depends whether or not crossover is used - as crossover re- 
lies on the coordinate system. This was one reason for omitting crossover here. 
Finally, PSO is (usually) not invariant under rotations, as all computations are 
done coordinate by coordinate II11I12II . 

4 Test functions and experimental setup 

4.1 Test functions 

The benchmark functions tested are given in Table [U The functions are tested 

Table 1. Test functions with coordinate-wise initialization intervals and target 
function value, where y := Bx implements an angle-preserving, linear transfor- 
mation, i.e. B is orthogonal. 

Function a Initialization /tai-^et 

/em(x) = LLi «^y? [LlQiO] [-20,80]" 10-" 

/RosenW=Itl'(«(y?->''-+l)' + (>V-l)') [1>10«] [-20,80]" IQ-" 

/ei(l'W=(ELi«^>'?)'^' [1=1010] [-20,80]" 10-" 



in their original axis-parallel version (i.e. B is the identity and y = x), and in 
rotated versions, where y = Bx. The orthogonal matrix B is chosen such that 
each column is uniformly distributed on the unit hypersphere surface Q, fixed 
for each run. 

The ellipsoid function /em is a convex-quadratic function where the parame- 
ter a is the condition number of the Hessian matrix that is varied between 1 and 
10^'' in our experiments. If a = 1 the ellipsoid is the isotropic separable sphere 

1/4 

function. The function /^jjj has the same contour lines (level sets) as f^m, how- 
ever it is neither quadratic nor convex. For a 7^ 1 , the functions /em and fj^^ are 
separable if and only if B = I. 

The Rosenbrock function /Rosen is non-separable, has its global minimum 
at X = [1, 1, . . . , 1] and, for large enough a and n, has one local minimum close 
tox= [—1,1,...,!], see also [fT3l . The contour lines of the Rosenbrock func- 
tion show a bent ridge that guides to the global optimum (the Rosenbrock is 
sometimes called banana function) and the parameter a controls the width of 



the ridge. In the classical Rosenbrock function a equals 100. For smaller a the 
ridge becomes wider and the function becomes less difficult to solve. We vary 
a between one and 10^. 

4.2 Experimental Setup 

For each algorithm tested we conduct 21 independent trials of up to 10^ function 
evaluations. If, for BFGS, no success was encountered, the number of trials was 
extended to 1001. 

We quantify the performance of the algorithms using the success perfor- 
mance SPl used in [14|, analyzed in [15] , and also denoted as Q-measure in 
lfT6l . The SPl equals the average number of function evaluations for success- 
ful runs divided by the ratio of successful experiments, where an experiment is 
successful if the /target is reached before 10^ function evaluations are exceeded. 
The SPl is an estimator of the expected number of function evaluations to reach 
/target if the algorithm is restarted until a success (supposing infinite time hori- 
zon) and assuming that the expected number of function evaluations for unsuc- 
cessful runs equals the expected number of evaluations for successful runs. 

5 Results 

Results are shown for dimension 20. Results for 10 and 40D reveal similar ten- 
dencies and are displayed in Appendix JA] 

Ellipsoid functions: dependencies Figure [T] shows SPl (search costs, expected 
running time in number of function evaluations) versus condition number on all 
ellipsoidal functions. A remarkable dependency on the condition number can be 
observed in most cases. The two exceptions are PSO on the separable functions 
and DE. In the other cases the performance declines by at least a factor of ten 
for very ill-conditioned problems as for CMA-ES. The overall strongest perfor- 
mance decline is shown by PSO on the rotated functions. NEWUOA shows in 
general a comparatively strong decline, while BFGS is only infeasible for high 
condition numbers in the rotated case, reporting some numerical problems. The 
decline of CMA-ES is moderate. 

For CMA-ES and DE the results are (virtually) independent of the given el- 
lipsoidal functions, where CMA-ES is consistently between five and forty times 
faster than DE. For PSO the results are identical on Ellipsoid and Ellipsoid'/"^, 
while the performance decline under rotation (left versus right figures) is very 
pronounced. PSO performs well only on separable or very well-conditioned 
functions. A similar strong decline under rotation can be observed for NEWUOA 
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Fig. 1. All ellipsoidal functions in 20D. Shown is SP\ (the expected running 
time or number of function evaluations to reach the target function value) versus 
condition number. 



on the Ellipsoid function for moderate condition numbers. BFGS, on the other 
hand, shows a strong rotational dependency on both functions only for large 
condition numbers > 10^. 

Switching from Ellipsoid (above) to Ellipsoid^/'* (below) only effects BFGS 
and NEWUOA. BFGS becomes roughly five to ten times slower. A similar ef- 
fect can be seen for NEWUOA on the rotated function. On the separable El- 
hpsoid function the effect is more pronounced, because NEWUOA performs 
exceptionally well on the separable Ellipsoid function. 



Ellipsoid functions: comparison On the separable Ellipsoid function up to a 
condition number of 10^ NEWUOA clearly outperforms all other algorithms. 
Also BFGS performs still better than PSO and CMA-ES while DE performs 
worst. On the separable Ellipsoid'/^ function BFGS, CMA-ES and PSO perform 
similar. NEWUOA is faster for low condition numbers and slower for large 
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Fig, 2. Rosenbrock function. Shown is SPl (the expected running time or num- 
ber of function evaluations to reach the target function value) versus condition- 
ing parameter a. 



ones. For condition number larger than 10^, NEWUOA becomes even worse 
than DE. 

On the rotated functions, the performance of PSO declines fast with increas- 
ing condition number. For numbers larger than 10^, PSO is remarkably outper- 
formed by all other algorithms. On the rotated Ellipsoid function for moderate 
condition numbers BEGS and NEWUOA perform best and outperform CMA- 
ES by a factor of five, somewhat more for low condition numbers, and less for 
larger condition numbers, while PSO and DE are much worse. For large condi- 
tion numbers CMA-ES becomes superior and DE is within a factor of ten of the 
best performance. 

On the rotated Ellipsoid^/'* BFGS and CMA-ES perform similar up to con- 
dition of 10^. NEWUOA performs somewhat better for lower condition num- 
bers up to 10"^. For larger condition numbers BFGS and NEWUOA decline and 
CMA-ES performs best. 

Rosenbrock function On the Rosenbrock function NEWUOA is the best algo- 
rithm (Figure O. NEWUOA outperforms CMA-ES roughly by a factor of five, 
vanishing for very large values for the conditioning parameter a. For small a, 
BFGS is in-between, and for a > \0^ BFGS fails. DE is again roughly ten times 
slower than CMA-ES. Only PSO shows a strong dependency on the rotation of 
the function and it reveals the strongest performance decline with increasing a, 
while it never competes with the best three algorithms. 

Scaling behaviors The scaling of the performance with search space dimension 
is similar for all functions (see Appendix |A] for the data). CMA-ES, NEWUOA 
and PSO show the best scaling behavior. They slow down by a factor between 



five and ten in 40D compared to lOD. For BFGS tlie factor is slightly above 
ten, while for DE the factor is thirty or larger, presumably because the default 
population size increase linearly with the dimension. 

6 Summary 

In this paper we have conducted a comparison of BFGS, NEWUOA, and three 
stochastic bio-inspired optimization methods in a black-box optimization sce- 
nario. The empirical study was conducted on smooth functions with varying 
condition number. Aside from gradients being not provided, we consider these 
functions as the favorite playgrounds of BFGS and NEWUOA. We find that 
NEWUOA performs exceptional on separable quadratic functions, it performs 
in all cases very well with moderate condition numbers, but shows a compara- 
tively steep performance decline with increasing ill-conditioning. BFGS per- 
forms well overall, but shows a strong decline on very ill-conditioned non- 
separable functions. For DE, the parameters are difficult to tune and yet it per- 
forms overall poorly with the single best parameter setting on our small func- 
tion set. With the chosen parameters, DE shows the strongest robustness to ill- 
conditioning though. PSO performs similar to CMA-ES on the separable prob- 
lems, with an even weaker dependency on the conditioning. On non-separable 
problems PSO exhibits a strong performance decline with increasing condition- 
ing and performs very poorly even on moderately ill-conditioned functions. 
Finally, CMA-ES generally outperforms DE and PSO, while up to a moder- 
ate function conditioning BFGS and NEWUOA are significantly faster in most 
cases. Due to their invariance properties, the performance results of CMA-ES 
and DE are the most stable ones and most likely to generalize to other functions. 
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Fig. 3. Ellipsoid function. Shown is SP\ (the expected running time or num- 
ber of function evaluations to reach the target function value) versus condition 
number. 
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Fig. 4. Ellipsoid^/'^ function. Shown is SP\ (the expected running time or num- 
ber of function evaluations to reach the target function value) versus condition 
number. 



Rosenbrock Function 



































^^^-^-l-^^^-^-^-s::Np¥ 


■:::::::::::::::::::: 




h^::^:::::i^^^^^^^^^^^^ 


^:":::::":::::::::: 


;i ^^ ^i/" 




/ /-<^ 






^;;*r:^: :::::::;:::: :j^^;|;>^ ::: iji^ 








"^/'^ .1-^ */ 


/ ^--^l-*'-'-'^ 




^:^:?-^ 


■nii- NEWUOA 

-0-EFGS 

-9-DE2 

-X-PSO 

H-CMAES 

















alpha 






Rotated Rosenbrock Function 










1 '■ 




^:;:;::::;::::;:;::;: 




:::j"::j:j^fj"::X--- ------- ---^^^^ 


t::::::::":::::"::" 








D : 


















3 : 


-i- NEWUOA 

-0-BFGS 

-®-DE2 

-K-PSO 

H-CMAES 


S 







10 


10 


10 


10 

alpha 


10 10 


) 


::::;::::;:;::;:;:::^:;:::;:;::;:;::::js::;:;::;:;::::;::::;;:::;:;::;:;:::::^^ 










:ar^-:-::±:-:-:-::-:-::-:-::^ 
















P'lr 








y^ 


Z^ 




y 






















r^*^ 7 ^^^^^„:jf^ ^ 


n : 




-^ 


i^ 






X 








^ 












; 








-■i- NEWUOA 
^)-BFGS 

DE2 
^*-PSO 
H-CMAES 










o" 


\ 


\ 


6 




e 1 





■■■■■■■■■■■■|P^"-' 


i:::::::::::::::::::: 




^=t^^'^'^^'^W^^'^^^ 






p^ 






>X j,''^ \y^ 






































■n^ NEWUOA 
-i^BFGS 

DE2 
-><-PSO 
H-CMAES 








S 4 6 


i'c 



10 10 



; 


^^^^^^30&M=^^^=^^^=. 


^:::::::::::::::::::: 










^ft^-^'^^'i / .-'t'^ 




; 


9mmmm^^mmm 












—^f^^^^^c^ 






P^iJ^^^^^^^^ 






^^ ^-^ 




: 


mgmsmsm 


-A- NEWUOA 
-C^BFGS 

DE2 
■^<-PSO 
H-CMAES 








2 4 6 





Fig. 5. Rosenbrock function. Shown is SP\ (the expected running time or num- 
ber of function evaluations to reach the target function value) versus condition- 
ing parameter a. 



